-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/remove duplicates in rcm #13
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resulting matrices are good, thanks!
Some small code requests.
def generate_gene_names(self, url): | ||
file_name = os.path.basename(urlparse(url).path) | ||
self.gene_names = self.download_gff(url, file_name) | ||
self.num_genes = len(self.gene_names) | ||
if not hasattr(self, 'gene_names') or not self.gene_names: | ||
self.download_gff(url, file_name) | ||
return self.gene_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this function is not used anymore, since process_gff
sets self.gene_names
. RM
@@ -45,23 +45,32 @@ def load_sample_info(self, json_file_path): | |||
self.treatments = [item['Treatment'] for item in sample_info] | |||
self.experiment_id = [item['ExperimentID'] for item in sample_info] | |||
|
|||
def download_gff(self, url, file_path): | |||
def download_gff(self, url, file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rename to download_and_process_gff
, since it does more than just download now.
|
||
self.gene_names = gene_info['GeneName'].tolist() | ||
self.num_genes = len(self.gene_names) | ||
return output_file_csv, file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return values are not used
if not os.path.exists(file_path): | ||
subprocess.run(['wget', '-O', file_path, url], check=True) | ||
self.process_gff(file_path) | ||
|
||
def process_gff(self, file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rename to something like write_gene_info
, since the function now saves the gene lengths in a csv.
print(f"Gene lengths have been saved to {output_file_csv}.") | ||
|
||
self.gene_names = gene_info['GeneName'].tolist() | ||
self.num_genes = len(self.gene_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this kind of situation, when you need to hold a list of values (gene_names
), and make decisions based the length of the list, it is best to simply get the length directly from the property when you need it. With self.gene_names
, all we need to get the number of genes is len(self.gene_names)
.
Otherwise, you have to maintain the state of 2 closely linked properties instead of just 1, which is not ideal for maintainability.
if not self.gene_names: | ||
raise ValueError("No gene names available for count matrix generation.") | ||
|
||
unique_gene_names = list(filter(None, set(self.gene_names))) | ||
if not unique_gene_names: | ||
raise ValueError("No valid gene names available after filtering duplicates and empty entries.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed, since the values are already deduplicated when the self.gene_names
property is set.
genes.loc[:, 'length'] = genes['end'].astype(int) - genes['start'].astype(int) + 1 | ||
genes = genes['GeneName'].dropna().tolist() | ||
return genes | ||
gene_info = genes[['GeneName', 'length']].dropna().drop_duplicates(subset='GeneName') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns should be GeneID
and GeneLength
for TDS
Code updates to refresh the GFF3 release, and code refactoring to prevent duplicate genId entries in the generated transcriptomic raw count matrix.